1 New York City Bagel Data

What are the characteristics of the best bagels in NYC, at least according to Highley Varlet at https://everythingiseverything.nyc/?

(Thanks again to Mike @ https://highleyvarlet.com/ for sharing their bagel data!)

2 Data Exploration

2.1 Borough Frequency

Out of 202 reviews, most took place in Brooklyn, Manhattan, and Queens.

2.2 Scores

In all categories, most of the reviews had scores of 3 to 4.5.

2.2.1 Scores by Borough

The distribution of scores was fairly similar between boroughs, though Manhattan had a higher concentration of reviews with higher Total scores. This seems to be driven by the overall higher Cheese scores and overall fewer middle-range Bagel scores in Manhattan.

2.3 Bagel Characteristics

The distribution of ratings was pretty uniform for most bagel characteristics. In other words, a good spread (ha ha) of different bagel types was covered in these reviews. The exception was bagel size: most of the reviewed bagels were larger as opposed to smaller.

2.3.1 Correlations between Characteristics

Were bagel characteristics correlated? For example, did large bagels also tend to be more salty, or did bagels with dairy-forward cream cheese tend to be in more contemporary stores?

To answer this question, we looked at the correlation coefficient between every pair of characteristics. Correlations closer to 1 or -1 signify strong positive or negative (linear) relationships, respectively. Correlations closer to 0 signify no (linear) relationship.

Surprisingly, the characteristics were not correlated, for the most part. We did note a moderately strong relationship between topping density and salt levels (correlation = 0.4), which makes sense.

3 Predicting Score

We wanted to see if the information in the bagel characteristics (bagel size, salt level, etc.) was predictive of the overall score. To do this, we considered a simple prediction model, starting with the tried-and-true methods:

  • Linear regression
  • Principal Components Analysis
  • Random Forest

Note: Future work will involve more sophisticated regression and machine learning models that can accomodate non-linearity between the characteristics and the total score (as the random forest does). As the total score was not quite Gaussian but rather a score from 1 to 5, it may be of interest to consider parametric regression models with ordered and multinomial outcomes. Finally, we are very interested in insights that we can gain from available the spatial and imaging data, such as: do higher-scoring bagels tend to cluster geographically?

3.1 Linear Regression (Characteristics Only)

First, we consider a model where only the bagel characteristics are used to predict total score.

Dependent variable:
Score_Total
c_contemporary_classic -0.003
(0.005)
c_variety_focused -0.007
(0.005)
c_crackly_chewy 0.006
(0.008)
c_xlarge_small 0.027***
(0.008)
c_toppingdense_light 0.005
(0.006)
c_highsalt_low 0.019***
(0.006)
c_finescallion_coarse -0.008
(0.006)
c_dairyfrwrd_latent -0.025***
(0.006)
Constant 3.638***
(0.057)
Observations 198
R2 0.225
Adjusted R2 0.192
Residual Std. Error 0.460 (df = 189)
F Statistic 6.843*** (df = 8; 189)
Note: p<0.1; p<0.05; p<0.01

The most important predictors of total score were:

  • Bagel size (larger bagels ~ higher score)
  • Salt level (saltier bagels ~ higher score)
  • Dairy level of cream cheese (dairy-forward ~ lower score).

Note: overall R-squared was not super high, suggesting that this model does not do the best job at predicting score.

3.2 Linear Regression (Characteristics + Borough)

Next, we consider a model where the bagel characteristics and borough are used to predict total score.

Dependent variable:
Score_Total
c_contemporary_classic -0.006
(0.006)
c_variety_focused -0.007
(0.005)
c_crackly_chewy 0.007
(0.008)
c_xlarge_small 0.029***
(0.008)
c_toppingdense_light 0.004
(0.006)
c_highsalt_low 0.020***
(0.005)
c_finescallion_coarse -0.008
(0.007)
c_dairyfrwrd_latent -0.023***
(0.006)
BoroughBrooklyn 0.275*
(0.150)
BoroughManhattan 0.293*
(0.152)
BoroughQueens 0.206
(0.154)
BoroughStaten Island 0.058
(0.176)
Constant 3.406***
(0.145)
Observations 198
R2 0.252
Adjusted R2 0.204
Residual Std. Error 0.457 (df = 185)
F Statistic 5.207*** (df = 12; 185)
Note: p<0.1; p<0.05; p<0.01

The most important predictors of total score were:

  • Bagel size (larger bagels ~ higher score)
  • Salt level (saltier bagels ~ higher score)
  • Dairy level of cream cheese (dairy-forward ~ lower score).
  • Borough (Brooklyn and Manhattan bagels ~ higher score)

Note: overall R-squared was again not super high, suggesting that this model does not do the best job at predicting score.

3.3 Principal Components Analysis

Although bagels were rated along 8 different characteristics, it is possible that they represent a smaller number of underlying, latent characteristics. That is, these bagels belong to some subgroup of bagels that is described by a combination of the 8 characteristics. These latent subgroups are called principal components (PCs):

## 
## Loadings:
##                        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## c_contemporary_classic  0.218  0.196  0.862  0.280  0.235         0.142       
## c_variety_focused              0.958 -0.156 -0.169        -0.125              
## c_crackly_chewy               -0.123 -0.116 -0.152                0.960  0.106
## c_xlarge_small         -0.123                      -0.271                0.948
## c_toppingdense_light   -0.533         0.148  0.146  0.189 -0.791              
## c_highsalt_low         -0.803         0.179                0.539  0.102       
## c_finescallion_coarse                       -0.510  0.789        -0.190  0.248
## c_dairyfrwrd_latent                  -0.404  0.761  0.446  0.205              
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000
  • PC 1: contemporary stores with smaller bagels that have lighter topping density and very low salt.
  • PC 2: contemporary stores with a much larger variety of offerings.
  • PC 3: very contemporary stores with dairy-latent cream cheese.
  • PC 4: contemporary stores with focused offerings, chewy bagels, and dairy-forward cream cheese with very coarse scallions.
  • PC 5: contemporary stores with small bagels and dairy-forward cream cheese with very fine scallions.

3.3.1 Can PCs Predict Score?

Instead of predicting the score using the 8 bagel features, we consider predicting the score using the 5 latent groups of bagels described above.

Dependent variable:
Score_Total
Comp.1 -0.019***
(0.004)
Comp.2 -0.007
(0.005)
Comp.3 0.012**
(0.005)
Comp.4 -0.015**
(0.006)
Comp.5 -0.026***
(0.006)
Constant 3.753***
(0.033)
Observations 198
R2 0.198
Adjusted R2 0.177
Residual Std. Error 0.465 (df = 192)
F Statistic 9.501*** (df = 5; 192)
Note: p<0.1; p<0.05; p<0.01

The most important predictors of total score were:

  • PC 1: these stores tend to score lower.
  • PC 3: these stores tend to score higher.
  • PC 4: these scores tend to score lower.
  • PC 5: these scores tend to score lower.

Overall, these findings suggest that the cream cheese probably impacts the scores the most, and that more dairyness is often associated with lower scores.

3.4 Correlations with Score

These correlations also give insight into the strength and directions of relationships between bagel characteristics and the bagel score:

Correlations greater than 0 signify a positive relationship (e.g., the higher the salt, the higher the bagel score); correlations less than 0 signify a negative relationship (e.g., the more dairy-forward the cream cheese, the lower the cheese score.)

3.5 Random Forest

The random forest a machine learning algorithm that models a flexible, non-linear relationship between the predictors (bagel characteristics) and outcome (score) through an ensemble of regression trees.

To understand which predictors are the most important, we considered importance measured in %IncMSE, which quantifies the increase in error (MSE) after permuting that predictor over all trees within a random forest model. Higher values correspond to higher importance.

Characteristic %IncMSE
High Salt (vs. Lower) 13.49
Contemporary Store (vs. Classic) 12.83
Dairy-Forward Cream Cheese 11.16
Large Bagels (vs. Smaller) 9.051
Topping Dense (vs. Less Dense) 6.01
High Variety Store (vs. Focused) 2.141
Crackly Bagel (vs. Chewier) 0.7917
Fine Scallion Cream Cheese (vs. Coarser) -0.4391

We find that store characteristics, salt level, and cream cheese dairyness were again most important, followed by bagel size and topping density.

4 Acknowledgements

Thanks again to Mike for providing the bagel review data, and to Nick Illenberger (NYU) and Sarah Weinstein (Penn) for their assistance in these analyses.